Fuzzy kernel evidence Random Forest for identifying pseudouridine sites

Abstract Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.

Identifying corresponding sites is crucial for understanding the mechanism and other functional roles of .However, accurately identifying sites within RNA sequences poses challenges due to the limitations of current methods.It would be advantageous to develop computational techniques that reliably anticipate sites from RNA sequence data, such as using machine learning or deep learning methods to build more efficient site recognition models, which may provide important new perspectives on the function of based on rapid and accurate recognition capabilities.
Li et al. presented PPUS, the first computational technique, in 2015 [19].It utilized a support vector machine (SVM) [20,21] and focused on identifying sites specifically in S. cerevisiae and H. sapiens.Building upon this, in order to anticipate sites in H. sapiens, S. cerevisiae and M. musculus, Chen et al. developed iRNA-PseU, which integrated nucleotide chemical characteristics with a pseudo nucleotide composition (PseKNC) encoding scheme, trained with SVM [22].Extensions to the field include the work by He et al., who developed PseUI by employing five distinct encoding techniques to extract sequence information from RNA segments.In order to maximize model performance, PseUI, which is also based on SVM, integrated a sequential forward feature selection technique [ 23].Tahir et al. presented iPseU-CNN, which is a two-layer convolutional neural network model that employed one-hot encoding to forecast sites [24].XG-PseU was developed by Liu et al. using a forward feature selection technique in conjunction with extreme Gradient Boosting (XGBoost) [25].An ensemble learning method called EnsemPseU was introduced by Bi et al.It integrates techniques from SVM, XGBoost, Naive Bayes, k-nearest neighbor (KNN) and Random Forest (RF) [26].Using the incremental feature selection process and the light gradient boosting machine (lightGBM), Lv et al. created RF-PseU, an RF-based technique [27].Saad et al. presented MU-PseUDeep, a convolutional neural network-based approach that incorporated both sequence and secondary structure features to enhance prediction performance [28].Li et al. subsequently devised Porpoise, a stacked prediction model that selects features from the top four categories and employs them as inputs for predicting [29].Additionally, Zhuang et al. introduced PseUdeep, a deep learning framework [30].Wang et al. established PsoEL-PseU, a feature fusion predictor, for the identification of sites [31].While computational methods for predicting sites have made significant strides, there are still limitations that must be addressed to develop more robust and accurate approaches, as shown in Supplementary Table S1.
In this study, we identify sites via fuzzy kernel evidence Random Forest ( PseU-FKeERF) model in H. sapiens, S. cerevisiae and M. musculus, as shown in Figure 1.First, we test several commonly used RNA sequence coding schemes and selected the best four coding schemes for feature combination.Then, we use fuzzy mean clustering and Gaussian fuzzy membership degree to construct fuzzy feature subset, expand the original feature space, form a new fuzzy feature set, and input it into the KeERF method for category prediction.Finally, we perform cross-validation and independent testing of data sets for each of three species to evaluate performance of model.The results show that the PseU-FKeERF has better predictive performance than other existing models.

Fuzzy subset
Fuzzy subset is a set of fuzzy concepts.Given a domain U, then a mapping μ A from U to the unit interval [0,1] is called a fuzzy subset of U: At present, there are many methods to get fuzzy subset.In TSK-FS [32], when the premise of TSK fuzzy model is determined, the corresponding fuzzy subset is obtained, and its obtaining process is described as follows.
First of all, assume that the training set x = [x 1 , x 2 , ..., x m ] ∈ R n×m contains m samples and n-dimensional features, where x i = (x i1 , x i2 , . . ., x in ) T ∈ R n×1 .For TSK fuzzy inference system, the most commonly used fuzzy inference rule R k is where K represents the number of fuzzy rules, B a j is the ath fuzzy set of the jth input feature and ∧ is a fuzzy association operator.f a (x) represents the defuzzification function of the local output under the ath fuzzy set.The output of TSK fuzzy model can be formulated as where μ a (x i ) and μ a (x i ) represent the fuzzy membership and the normalized fuzzy membership associated with the fuzzy set B a , respectively, which are calculated by the following formula: where μ B a i (x ij ) can be calculated using Gaussian membership function where c a j and σ a j are the center and variance of the ath fuzzy set in the jth dimension, which can be estimated by clustering technique.The FCM algorithm is used to estimate c a j and σ a j : where u ia represents the membership value of sample x i belonging to cluster a, and h is the scale parameter, which can be adjusted manually.
After determining the parameters of c a j and σ a j , let the output of the ath fuzzy rule be x e = 1, Thus, the output of K fuzzy rules can be represented as follows: where x g is the final fuzzy subset, which enlarges the original feature space.

Dempster-Shafer evidence theory
Famous academics Dempster and Shafer created the D-S evidence theory [33,34].Suppose there is a problem that needs to be decided, and the complete set of all possible results is denoted by , and there is a mutually exclusive relationship between all elements of .The set is called the recognition frame and can be expressed as: where θ j is an element of the recognition frame .The set 2 of all subsets under the identification frame is expressed as After determining the identification framework, evidence theory uses the basic trust distribution function to systematically summarize the final distribution results of all propositions.The basic trust assignment function m on the identification frame is a mapping of 2 −→ [0, 1], which satisfies the following conditions: where m(D) represents the degree of support of evidence for proposition D, and its value is the proposition's fundamental trust assignment value.The empty set has a base trust value of zero, and the total trust value of all other subsets is equal to one.D is referred to as a focal element if the value of m(D) exceeds 0.
Dempster proposed a method of evidence synthesis, that is, the basic trust allocation function of two or more evidences is calculated in the way of orthogonal sum, which is called Dempster synthesis rule.Suppose that under the same recognition framework , there are n groups of evidence, m 1 , m 2 , . . ., m n is the basic trust assignment function corresponding to each evidence, and the focal elements are D 1 , D 2 , . . ., D n , Dempster's composition rule is as follows: The equation is unstable when used in the prediction of a large number of estimators.Therefore, for predictions from a large number of different estimators, a simple average of the mass function is preferred, which is defined as follows:

The Jousselme distance approach
Evidence distance is used to represent the similarity between evidence.Jousselme et al. proposed Jousselme distance [35] by using the geometric interpretation of evidence theory.
Assuming that the recognition frame contains several elements, a high-dimensional space can be obtained by taking the elements in 2 as coordinates.Each evidence can be represented as a vector in the higher-dimensional space.If m f and m k are two independent evidences in the recognition framework, the evidence is represented as a vector in the space, denoted as m f and m k .The Jousselme distance between m f and m k is defined as: where D is a 2 N × 2 N symmetric matrix, and the elements in the matrix are The ratio of intersection and union of focal elements A i and B j is used to express their similarity, which is called Jaccard coefficient.

The inclusion degree
Martin defines two inclusions [36] to measure how two mass functions are contained, one for strict inclusion and the other for light inclusion.Strict inclusion requires that the mass function m a is contained in m b if all focal elements of m a are contained in every focal element of m b .A strict degree of inclusion of m a in m b is given by where On the basis of strict inclusion, Hoarau defines a new type of inclusion called the fair inclusion.The fair inclusion is defined as the mass function m a being fairly included in m b if all focal elements of m a on 2 \ are, in turn, included in each focal element of m b on 2 \ .It is given as follows: where L a and L b represent the set of focal elements of L a and L b on 2 \ , respectively.The fair inclusion is a slightly less strict degree of inclusion than strict inclusion, because ignorance is contained only in itself.
On the basis of the integration of the above techniques, we improved the method to build a better classification performance of sites recognition model.

Nucleotide chemical property
Chemical structures and characteristics vary throughout nucleotides.Table 1 illustrates how four nucleotides may be grouped into three distinct groups based on their chemical characteristics.

Pseudo k-tuple composition
One kind of pseudo nucleic acid composition feature that takes into account both local and long-range sequence information is the PseKNC feature.PseKNC is defined as follows: where where f α α = 1, 2, . . ., 4 k represents the frequency of oligonucleotides, ω represents the factor and θ n is defined as follows: ) is the correlation function, which is defined as: where σ represents the number of physicochemical indices and P ξ (R m R m+1 ) is the value of the ξ th (ξ = 1, 2, . . ., σ ) physicochemical index of dinucleotide R m R m+1 at the position m.

Evidence RF model based on fuzzy logic and kernel method
We construct an evidence RF model based on fuzzy logic and kernel methods, as shown in Figure 2. The following is the specific construction process of the model.First of all, since the dimension of the original feature set obtained after feature extraction is small, in order to further improve the recognition performance of the subsequent model, the implementation idea of fuzzy subset is obtained by referring to TSK-FS, the original feature set is processed by fuzzy logic to obtain multiple fuzzy subsets, and a new fuzzy feature set is obtained by merging all fuzzy subsets.
Then, the new fuzzy feature set is input into the evidence RF classifier based on kernel method (KeERF) for training.The construction of KeERF is divided into the following two steps: Traversing the S train , counting the number of x and each training sample x i , i = 1, 2, . . ., m falls on the same terminal node, and calculating the result of K W,Z (x, x i );

13:
Calculating the prediction function F W,Z (x, T 1 , . . ., T w ); The conf lict measure of m a and m b Info (s) The information of node s The weighted sum of the information of the child nodes split on the attribute A i Gain(s, A i ) The split information gain on attribute A i m (D) The mass function value of class ω in proposition D Bet P (ω) The pignistic probability of the mass function The posterior probability F W,Z (x, T 1 , . . ., T w ) The prediction function Ŷ The predicted label value (i) Construct evidence decision trees by using conf lict concept in belief function theory.(ii) Using bagging and random feature selection on the evidence decision tree to obtain evidence RF.
Finally, the kernel method is used to predict the category labels of unclassified samples.The FKeERF method is summarized and introduced in detail in Algorithm 1.The following formula symbols are described in Table 2.

Evidential decision trees
Unlike traditional decision trees, the evidence decision tree proposed by Hoarau et al. improves the node splitting criteria [44].
They propose to use a conf lict measure based on inclusion and Jousselme distance as a split criteria.First, based on fair inclusion, the inclusion of m a and m b is defined as: This inclusion gives the maximum proportion of focus elements in one mass function in the other.The conf lict measure of m a and m b is expressed as: which is used as the node splitting criterion in the evidence decision tree.According to the collision measurement, the information Info(s) of node s is defined as follows: Let node s be a set of observations.Let Attr = A 1 , A 2 , ... A m be its attribute set, then the split information gain Gain(s, A i ) on attribute A i is where Info (s) is the information of the node s and Info A i (s) is the weighted sum of the information of the child nodes split on the attribute A i , which is expressed as: s v is the subset of s for which attribute A i has the value v. Select the attribute that maximize the gain function for splitting a node.Until the stop criteria are satisfied, splits are carried out recursively for every child node, beginning at the root of each node.
Once the tree is constructed, a new observation will pass through the tree from the root based on the value of its attribute.The observation will be given a mass function equal to the node's average mass function once a leaf is reached.
In evidence theory, decisions are made based on the resulting pignistic probabilities by converting mass functions into pignistic probabilities.Therefore, the class that maximizes the pignistic probability Bet P(ω) of this mass function is the prediction class: where

Evidence RF based on kernel method
Based on evidence decision tree, bagging and random feature selection are used to construct evidential RFs.After completing the construction of the evidence RF, for unclassified samples, instead of directly inputting them into the RF to obtain the corresponding label as in traditional prediction methods, the prediction will be carried out using the kernel method [45,46].Firstly, the posterior probability K W,Z (x, x i ) that the unclassified sample x and all training samples X train = {x 1 , x 2 , . . ., x i , . . ., x Q } fall on the same leaf node in the evidence RF is calculated: where W is the number of evidence decision trees in the evidence RF, T w represents the split mode of evidence decision tree in the w th iteration, and I z(x,Tw) is defined as a node containing x in the evidence RF, which is determined by T w and the training set X train .When x i and x are connected to the same leaf node in the evidence decision tree model, the value of 1 x i ∈Iz(x,T w ) is equal to 1.
Then use the prediction function F W,Z (x, T 1 , . . ., T w ) to obtain an approximate prediction value Finally, according to the approximate predicted value, the label value is predicted by using the symbol function

RESULTS
The following experiments were run on a server equipped with an Intel Xeon Platinum 8168 CPU and 1.0 TB of memory.We trained the model using the Python programming language and experimented with the Ubuntu 16.04.1 LTS operating system.

Datasets
The benchmark dataset collected from RMBase [47] by Chen et al. [22] is used for comparison, as shown in Table 3.These datasets consist of three training datasets, H_990, S_628 and M_944, and two separate test datasets, H_200 and S_200.The sample sizes of the three training datasets are 990 628 and 944, with half of each dataset containing sites and the other half not.Both independent testing datasets contain 200 samples, half of which contain sites.In both the H. sapiens dataset (H_990 and H_200) and the M. musculus dataset (M_944), all samples exhibit RNA sequences consisting of 21 nucleotides with uridine positioned at the center.Conversely, within the S. cerevisiae datasets (S_628 and S_200), the RNA sequences encompass 31 nucleotides with uridine occupying the central position.All datasets used in the experiment are existing datasets.In order to ensure the validity of the results comparison, we used the same dataset as other comparison models.

Evaluation metrics
This study presents a new method for identifying sites.We evaluated the performance of our method using five metrics widely used in previous studies: specificity (SP), sensitivity (SN), accuracy (ACC), Matthew correlation coefficient (MCC), F1 score, area under the precision-recall curve (AUPR) and area under the receiver operating characteristic curve (AUC) [48][49][50][51][52][53][54][55].The AUC (Area Under Curve) is defined as the area under the ROC curve and enclosed by a coordinate axis, and its value ranges between 0.5 and 1.The closer the AUC is to 1.0, the higher the authenticity of the detection method.The formula for these metrics is as follows: where TP represents true positive, FP represents false positive, TN represents true negative, FN represents false negative.We use these metrics to evaluate the performance of our methods and guide our experiments.Independent testing experiment refers to applying a model trained on a training dataset to an independent testing dataset to evaluate the generalization ability of the model.

Performance evaluation of different feature extraction schemes
In this section, we thoroughly investigate and evaluate the performance of seven different feature encodings including Binary, NCP, ENAC, PseKNC, EIIP, PSTNPss and ANF by conducting 10fold cross-validation tests and identify the best combination of features.First, we extracted features from the training sets of the three species according to seven coding schemes respectively, and then selected RF as a classifier to conduct 10-fold cross-validation tests to obtain the performance results of each coding scheme, as shown in Figure 3. Through comprehensive analysis of the test results of the three training sets, it can be seen that PSTNPss has the best performance, Binary, NCP, ENAC, PseKNC and EIIP follow, and ANF has the worst performance.Therefore, we choose the remaining six encoding schemes besides ANF for feature combination.According to the performance of each coding scheme in the previous stage, we designed four feature combination modes  and input them into the RF classifier for testing.Finally, the performance results of the four feature combination schemes were obtained, as shown in Figure 4. Through comprehensive analysis, it can be seen that the combination of Binary, PSTNPss, NCP and PseKNC has the best performance.After testing single features and feature combinations, we finally selected Binary, PSTNPss, NCP and PseKNC feature combinations for subsequent experiments

Parameter optimization and analysis
Hyperparameters affect the recognition performance of the model, and other advanced methods optimize the hyperparameters to some extent, as shown in the Supplementary Table S2.We also discuss the role of hyperparameters in PseU-FKeERF: (i) The number of clusters in fuzzy mean clustering; (ii) construct the number of decision trees of the forest; (iii) minimum number of leaf node samples.Parameter analysis experiments were carried out on the training datasets of three species.
In the process of constructing fuzzy feature set, it is necessary to carry out fuzzy mean clustering first.The number of clusters determines the number of fuzzy subsets, which is an important hyperparameter that affects the performance of site recognition model.Therefore, we preliminarily set the number of clusters from 3 to 10, with an interval of 1, for testing.Figure 5 shows the performance results of the model under different cluster numbers when the number of decision trees is 100 and the minimum leaf node sample number is 5.As shown in the figure, when the cluster number is set to 3, our model achieves high performance on the H. sapiens and S. cerevisiae training sets.For the M. musculus training set, the performance is better when the number of clusters is 8.
We then used grid search to estimate the effects of two additional hyperparameters: the number of decision trees with search ranges from 80 to 220 at intervals of 20 and the number of samples of the smallest leaf nodes with search ranges from 3 to 10 at intervals of 1. Figure 6 shows the performance results when the two hyperparameters are set to different values when the cluster number is 3.As shown in the figure, when the number of decision trees is 140 and the minimum sample number of leaf nodes is 10, the model performs best on the H. sapiens training set.For S. cerevisiae, the accuracy of the model is the highest when the number of decision trees is 120 and the minimum leaf node sample is 9.The M. musculus training set has the best performance when the number of decision trees is 200 and the minimum sample number of leaf nodes is 6.
In general, based on the above analysis of experimental results, we determined the optimal parameter setting of the PseU-FKeERF model on each species dataset, including the number of clusters, the number of decision trees and the number of minimum leaf node samples, which are crucial for achieving the best model performance.

Ablation study
To verify the contribution and validity of each module in the FKeERF method, we conducted ablation studies on training sets and independent testing sets for three species.We perform the ablation experiment by removing the kernel method to predict the class module and the fuzzy subset module.Specifically, the first method, ERF, does not add the kernel method prediction category module and the fuzzy subset module, that is, the original evidence RF method.The second method, KeERF, is to add only the kernel method prediction category module and use the kernel method to predict the category label of each unclassified sample.The third method FKeERF is to add fuzzy subset module on the basis of KeERF.By combining fuzzy logic rules, multiple fuzzy subsets are fused to form fuzzy feature set and expand the original feature space.
The final test results are shown in Tables 4 and 5, respectively.Compared with the other two methods, FKeERF has the best performance.The comparison results of the first two methods show that removing the category prediction module of the kernel method will reduce the prediction performance of the model, which indicates that introducing the kernel method to improve the category label prediction is helpful to improve the prediction performance of the model.For the ablation of fuzzy subset modules, the results show that our method FKeERF expands the original feature space by combining fuzzy logic rules, and further improves the prediction performance.In general, the ablation experiments show that the kernel method prediction category module and fuzzy subset module in our method are effective for improving the prediction performance.

Comparison with traditional machine learning methods
We compare our model with RF, SVM, KNN, XGBoost and other methods to further confirm the performance of our approach, FKeERF.We tested the predictive performance of all classifiers in the three species training sets and independent testing sets, and the results are shown in Tables 6 and 7.
As can be seen from Table 6, FKeERF achieved the overall best performance on three species in terms of ACC, MCC, and AUC, compared to the other four traditional classifiers on the same training data set.For H. sapiens, FKeERF performed best in ACC, MCC and AUC, with its ACC, MCC and AUC 3.44, 6.11 and 2.89% higher than the next best RF, respectively.XGBoost performed worse than RF, but better than SVM, and KNN performed the worst.On the training set of S. cerevisiae, FKeERF obtained the best performance on ACC, MCC and AUC, 87.73, 75.89 and

Figure 1 .
Figure 1.The overall framework of PseU-FKeERF.There are four main steps, including dataset preparation, feature extraction, FKeERF model training and optimization, and performance evaluation.
a and F b , respectively, are the set of focal elements of m a and m b , and |F a | , |F b | are the number of focal elements of m a and m b .Light inclusion says that the mass function m a is contained in m b if all focal elements of m a are contained in at least one focal element of m b .It is defined as follows:

Figure 2 .
Figure 2. The f lowchart of FKeERF method.FKeERF method has three parts, including constructing fuzzy feature set, constructing evidence decision tree and category prediction.(A) By clustering the original feature set through fuzzy means, several clusters and the mean and variance of each cluster are obtained.Then, multiple fuzzy feature subsets are obtained by using Gaussian membership function, and fuzzy feature sets are obtained by fusing multiple fuzzy feature subsets.(B) Use fuzzy feature set to construct evidence random forest.(C) Input the training set and test set samples into the evidence random forest, respectively, count the number of the two falling on the same node, use the kernel function and prediction function to obtain the prediction result, and combine the symbol function to obtain the final prediction label.

14 :Table 2 :
Obtaining the class of x from the sign function Ŷ = sgn(F W,Z (x, T 1 , . . ., T w )); 15: end for 16: return prediction labels Y pre for D test .Description of symbols in FKeERF model formulas Notation Description δ a⊆b (m a , m b ) The fair degree of inclusion δ N (m a , m b ) The degree of inclusion of m a and m b C (m a , m b ) ω represents the class, m(D) is the mass function value of class ω in proposition D, and |D| represents the number of mass functions of proposition D.

Figure 3 .
Figure 3.Comparison of different feature extraction schemes.

Figure 4 .
Figure 4. Comparison of different feature extraction scheme combinations.

Figure 5 .
Figure 5. Performance comparison under different cluster number.

Figure 6 .
Figure 6.Parameter setting analysis of decision tree number and minimum leaf node sample number in FKeERF model on three species datasets.

Table 1 :
Chemical properties of four nucleotide types

Algorithm 1 :
FKeMRF Input: A training set D train , a testing set D test , the number of clusters K , number of estimators in forest N and the minimum number of samples at a leaf L. Output: Prediction labels for unclassified samples Y pre 1: Calculating the mean C and variance D of each cluster using FCM Generate_fcm(D train , K); 2: Obtaining fuzzy feature subsets g v , v = 1, 2, . . ., K by fuzzy processing based on Gaussian membership function Calculate_gauss(D train , D test , C, D); 3: Obtaining fuzzy feature set G new by merging fuzzy feature subset g v , v = 1, 2, . . ., K; 4: for n in 1, . . ., N do Inputting all samples from the D train into the ERF, record the forest terminal node it falls on, and store it in S train .10: for x in D test do

Table 3 :
The information of benchmark datasets